Using Categorical Clustering in Schema Discovery
نویسندگان
چکیده
Most techniques for managing relational schemas assume a given schema that adequately models the data [1]. However, we know that in practice, the semantics of the data may evolve over time and its schema (its table structures and constraints) is not always updated to reflect these changes [5]. Common examples include the overloading of tables to store facts of different types (for example, an order table originally designed to store service orders may be used to store various product orders as a company expands the scope of its business). Similarly, the semantics (typically represented as constraints) may evolve, perhaps because new data does not share the original semantics or because the full semantics were not captured in the original legacy design. Our long term research goal is to find techniques for discovering schemas that fit the data. In this work, we are taking an initial step toward this goal. Specifically, we are examining the benefits of using categorical clustering to discover groupings of tuples that share similar structural characteristics. Our work is motivated by some recent studies that have used information theory to characterize good relational and nested relational (XML) schemas. Dalkilic and Robertson [4] have derived a measure for functional dependencies that quantifies the uncertainty left in a set of attributes, when another set of attributes is known. In addition, Arenas and Libkin [3] have introduced an information theoretic measure of well-designed database tables that takes into account the context in which each value appears. However, none of these methods suggest an efficient algorithm to find good decompositions or to suggest a good schema for a data set. In addition, this work considers dependencies that strictly hold in relations and so the techniques are not immediately applicable to dirty data. In real data, dependencies are often approximate, i.e., there are a small number of tuples or values where given dependencies do not hold. We believe that such dependencies do exist in large legacy databases and that they can be used to help understand the semantics of a data set. As a first step in discovering structure, we have developed an algorithm that groups similar records of relational tables containing categorical data. While clustering is a very wellstudied problem, the techniques for clustering categorical data suffer from a number of limitations that make them unsuitable for use in a schema discovery application. For example, current categorical clustering algorithms do not scale to the large multi-attribute relational tables we consider. To address this and other limitations, we have introduced LIMBO, [2], a scalable hierarchical categorical clustering algorithm. LIMBO builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering [7]. We use it to cluster both tuples and attribute values. While the IB method has been applied before to cluster small data sets, [6], LIMBO is the first scalable hierarchical clustering algorithm to use this method. It supports a tradeoff between computation time and clustering quality. It handles large data sets efficiently and is robust over different input orders of. Finally, LIMBO produces clusterings for a range of k values (where k is the number of clusters). We take advantage of this feature to examine heuristics for selecting good clusterings within this range. This property is very important for schema discovery since we can pick a k value (among the many clusterings produced in a single application of the algorithm) for which we can achieve a good schema design. Our objective is to produce informative clusters, i.e., clusters that convey maximum information about their attribute values. That is, given a cluster, we wish to predict the attribute values associated with tuples of the cluster accurately. The quality measure of the clustering is then the mutual information of the clusters and the attribute values. Since a clustering is a summary of the data, some information is generally lost. Our objective will be to minimize this loss, or equivalently to minimize the increase in uncertainty as the tuples are grouped into fewer and larger clusters. In this work, we will report on our experience using LIMBO to find clusterings with good structural characterics. That is, we plan to use LIMBO to find tuple clusters and evaluate their structural characteristics. The primary focus of our preliminary research will be to demonstrate the advantages (or disadvantages) of using the loss of information as a guide to discover relational structure.
منابع مشابه
ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها
Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...
متن کاملClustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach
The Web has been rapidly “deepened” with the prevalence of databases online. On this “deep Web,” numerous sources are structured, providing schema-rich data– Their schemas define the object domain and its query capabilities. This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with simil...
متن کاملECCLAT: a New Approach of Clusters Discovery in Categorical Data
In this paper we present a new approach for the discovery of meaningful clusters from large categorical data (which is an usual situation, e.g., web data analysis). Our method called Ecclat (for Extraction of Clusters from Concepts LATtice) extracts a subset of concepts from the frequent closed itemsets lattice, using an evaluation measure. Ecclat is generic because it allows to build approxima...
متن کاملConcept clustering for cooperation
Most heterogeneous Clinical Information Systems share a strong semantic similarity in spite of their autonomy and heterogeneity. To exploit this semantic similarity, we propose a concept discovery approach based on statistical clustering techniques to develop a generic conceptual schema to establish interoperability. The usage pattern of users is not just statistical information, but carries co...
متن کاملUsing Categorical Attributes for Clustering
The traditional clustering algorithms focused on clustering numeric data by exploiting the inherent geometric properties of the dataset for calculating distance functions between the points to be clustered. The distance based approach did not fit into clustering real life data containing categorical values. The focus of research then shifted to clustering such data and various categorical clust...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003